Libraries
library(FactoMineR)
library(tidyr)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(ggpubr)
library(factoextra)
library(gridExtra)
library(moments)
Screw Caps Data
raw_data <- read.table("ScrewCaps.csv",header=TRUE, sep=",", dec=".", row.names=1)
head(raw_data)
summary(raw_data)
Supplier Diameter weight nb.of.pieces Shape Impermeability Finishing Mature.Volume Raw.Material Price Length
Supplier A: 31 Min. :0.4458 Min. :0.610 Min. : 2.000 Shape 1:134 Type 1:172 Hot Printing: 62 Min. : 1000 ABS: 21 Min. : 6.477 Min. : 3.369
Supplier B:150 1st Qu.:0.7785 1st Qu.:1.083 1st Qu.: 3.000 Shape 2: 45 Type 2: 23 Lacquering :133 1st Qu.: 15000 PP :148 1st Qu.:11.807 1st Qu.: 6.161
Supplier C: 14 Median :1.0120 Median :1.400 Median : 4.000 Shape 3: 8 Median : 45000 PS : 26 Median :14.384 Median : 8.086
Mean :1.2843 Mean :1.701 Mean : 4.113 Shape 4: 8 Mean : 96930 Mean :16.444 Mean :10.247
3rd Qu.:1.2886 3rd Qu.:1.704 3rd Qu.: 5.000 3rd Qu.:115000 3rd Qu.:18.902 3rd Qu.:10.340
Max. :5.3950 Max. :7.112 Max. :10.000 Max. :800000 Max. :46.610 Max. :43.359
2) We start with univariate and bivariate descriptive statistics. Using appropriate plot(s) or summaries answer the following questions.
a) How is the distribution of the Price? Comment your plot with respect to the quartiles of the Price.
From the quantile data, the summary statistics are given by: median, 1Q and 3Q as 14.432, 11.864 and 19.04 respectively.
The plots, the kurtosis and the skewness parameters suggest the price follows a bimodal distribution that is “skewed right”. The major mode is around 14 and the antimode is around 29. Furthermore, 50% of the prices in the range 11.864 and 19.04. This is consistent with graph where the majority of the density is concentrated inside this range and a long right tail of prices outside.
The boxplot supports this analyis and suggests the values in the tail are outliers.
price_density <- ggdensity(raw_data,x="Price",y = "..count..",
color="darkblue",
fill="lightblue",size=0.5,
alpha=0.2,
title = "Screw Cap Price Distribution",
linetype = "solid", add = c("median"))+ font("title", size = 12,face="bold")
price_boxplot <- ggboxplot(raw_data$Price, width = 0.1, fill ="lightgray", outlier.colour = "darkblue", outlier.shape=4.2, ylab = "Price", xlab = "Screw Caps" , title = "Price Box Plot") + rotate() + font("title", size = 12,face="bold")
price_quantile <- quantile(raw_data$Price)
ggarrange(price_density, price_boxplot, ncol = 1, nrow = 2)
price_quantile
0% 25% 50% 75% 100%
6.477451 11.807022 14.384413 18.902429 46.610372
skewness(raw_data$Price)
[1] 1.706151
kurtosis(raw_data$Price)
[1] 6.395453
b) Does the Price depend on the Length? weight?
We examine Price vs. Length, log(Price) vs. log(Length); Price vs. weight, log(Price) vs. log(weight) and provide the summary for each.
The plots suggest somewhat of a relationship between the variables, but observing the results of the F and T tests confirm this to a high degree of significance.
price_length <- ggplot(raw_data, aes(x=Length, y=Price)) + geom_point() + geom_smooth(method=lm, color="darkgreen")+ theme_minimal()
price_length_log <- ggplot(raw_data, aes(x=log(Length), y=log(Price))) + geom_point() + geom_smooth(method=lm, color="darkgreen")+ theme_minimal()
price_weight <- ggplot(raw_data, aes(x=weight, y=Price)) + geom_point() + geom_smooth(method=lm,color="red")+theme_minimal()
price_weight_log <- ggplot(raw_data, aes(x=log(weight), y=log(Price))) + geom_point() + geom_smooth(method=lm,color="red")+theme_minimal()
ggarrange(ggarrange(price_length, price_length_log, ncol = 2, nrow = 1), ggarrange(price_weight, price_weight_log, ncol = 2, nrow = 1), ncol = 1, nrow = 2)
summary(lm(formula = Price ~ Length, raw_data))
Call:
lm(formula = Price ~ Length, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-13.901 -2.854 -0.741 1.931 16.181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.94613 0.50918 17.57 <2e-16 ***
Length 0.73168 0.03953 18.51 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.308 on 193 degrees of freedom
Multiple R-squared: 0.6397, Adjusted R-squared: 0.6378
F-statistic: 342.6 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = log(Price) ~ log(Length), raw_data))
Call:
lm(formula = log(Price) ~ log(Length), data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-0.70368 -0.15501 -0.01661 0.15170 0.59211
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.56380 0.07278 21.49 <2e-16 ***
log(Length) 0.53875 0.03282 16.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2466 on 193 degrees of freedom
Multiple R-squared: 0.5827, Adjusted R-squared: 0.5805
F-statistic: 269.5 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = Price ~ weight, raw_data))
Call:
lm(formula = Price ~ weight, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-14.7993 -2.6207 -0.6631 2.5396 13.8357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.2275 0.5602 14.69 <2e-16 ***
weight 4.8312 0.2718 17.78 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.419 on 193 degrees of freedom
Multiple R-squared: 0.6208, Adjusted R-squared: 0.6189
F-statistic: 316 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = log(Price) ~ log(weight), raw_data))
Call:
lm(formula = log(Price) ~ log(weight), data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-0.71123 -0.15340 -0.01343 0.17735 0.69552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.50618 0.02333 107.42 <2e-16 ***
log(weight) 0.56453 0.03718 15.18 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2577 on 193 degrees of freedom
Multiple R-squared: 0.5443, Adjusted R-squared: 0.5419
F-statistic: 230.5 on 1 and 193 DF, p-value: < 2.2e-16
c) Does the Price depend on the Impermeability? Shape?
The plots below suggests there is dependency on Impermeability - the medians differ significantly.
impermability_plot_1 <- ggdotplot(raw_data,x="Impermeability",y="Price",color = "Impermeability", palette = "jco",binwidth = 1,legend="none")
shape_plot_1 <- ggdotplot(raw_data,x="Shape",y="Price",color = "Shape", palette = "npg",binwidth = 1,legend="none")
impermability_plot_2 <- ggboxplot(raw_data,x="Impermeability",y="Price",color = "Impermeability", palette = "jco",legend="none")
shape_plot_2 <- ggboxplot(raw_data,x="Shape",y="Price",color = "Shape", palette = "npg", legend = "none")
ggarrange(ggarrange(impermability_plot_1,impermability_plot_2,ncol = 2, nrow = 1),
ggarrange(shape_plot_1,shape_plot_2,ncol = 2, nrow = 1),
ncol = 1, nrow = 2)
d) Which is the less expensive Supplier?
The answer to this question depends on the definition of expensive.
First, examine the following absolute metrics (this can be seen via the boxplot) 1) Absolute price - Supplier B cheapest (6.477451). However, Supplier B is also the supplier which has the highest absolute price (46.610372) 2) Average Price - Supplier C cheapest (14.88869)
Second, examine the following relative metrics:
3) Average Price / Unit Length - Supplier A (1.505043) 4) Average Price / Unit weight - Supplier A (9.013902) 5) Average Price / Unit Diameter - Supplier A (11.95632)
The result above suggest Supplier A has the cheapest average price per unit of production.
The analysis however is not complete given we do not have a definition of cheapest price. Even the scatter and box plots below suggest suppliers may cater to specific product ranges. It also ignores the categorical data which could provide some insights into cheapest price for certain product features Furthermore, we have not performed statistical tests to examine the significance of these differences.
supplier_plot_1 <- ggboxplot(raw_data,x="Supplier",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),legend="none") + rotate()
supplier_plot_2 <- ggscatter(raw_data,x="Length",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_plot_3 <- ggscatter(raw_data,x="weight",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_plot_4 <- ggscatter(raw_data,x="Diameter",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_statistics <- raw_data %>% group_by(Supplier) %>% summarise( "Average Price" = mean(Price), "Average Length" = mean(Length),"Average weight" = mean(weight),"Average Diameter" = mean(Diameter), "Average Price / Length" = mean(Price)/mean(Length), "Average Price / weight" = mean(Price)/mean(weight), "Average Price / Diameter" = mean(Price)/mean(Diameter))
supplier_plot_1
supplier_plot_2
supplier_plot_3
supplier_plot_4
head(supplier_statistics)
3) One important point in explanatory data analysis consists in identifying potential outliers. Could you give points which are suspect regarding the Mature.Volume variable? Give the characteristics (other features) of the observations that seem suspsect
There are four data points which seem suspect - they have the same characteristics for Diameter, weight, nb.of.pieces, Impermeability, Finishing, Raw.Material and Mature.Volume. They differ in their supplier, price and length. These suggest some error in collating the data (system error / default data).
Mature.Volume_plot <- gghistogram(raw_data,x="Mature.Volume",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
Mature.Volume_plot
raw_data %>% filter (Mature.Volume > 6e+05 )
For the rest of the analysis, the 4 data points above are disregarded.
library(dplyr)
raw_data <- raw_data %>% filter (Mature.Volume < 6e+05 )
4) Perform a PCA on the dataset ScrewCap, explain briefly what are the aims of a PCA and how categorical variables are handled?
Principal components analysis (PCA) is a technique for taking high-dimensional data, and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information - we try capture the essence of high dimentional data in a low dimensional representation. The aim of PCA is to draw conclusions from the linear relationships between variables by detecting the principal dimensions of variability. This may be for compression, denoising, data completion, anomaly detection or for preprocessing before supervised learning (improve performance / regularization to reduce overfitting).
The categorical variables cannot be represented in the same way as the supplementary quantitative variables since it is not possible to calculate the correlation between a categorical variable and the principal components. The categorical variables here are handled as supplemetary variables on a purely illustrative basis - they are not used to calculate the distance between inidividuals. We represent a categorical variable at the barycentre of all the individuals possessing that variable. A categorical variable on the PCA performed below can therefore be regarded as the mean individual obtained from the set of individuals who have it.
Given our ultimate goal here is to explore data prior to a multiple regression, it is advisable to choose the explanatory variables for the regression model as active variables for PCA, and to project the variable to be explained (the dependent variable) as a supplementary variable. This gives some idea of the relationships between explanatory variables and thus of the need to select explanatory variables. This also gives us an idea of the quality of the regression: if the dependent variable is appropriately projected, it will be a well-fitted model. Thus we select Price as a supplementary variable.
The dataset in this exercise contains 6 supplementary variables: - 1 quantitative variable (Price) - 5 qualitative variables (Supplier, Shape, Impermeability and Finishing).
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10, graph = FALSE)
fviz_pca_ind(res.pca, col.ind="cos2", label=c("quali"), geom = "point", title = "Individual factor map (PCA)", habillage = "none") + scale_color_gradient2(low="lightblue", mid="blue", high="darkblue", midpoint=0.6) + theme_minimal()
plot.PCA(res.pca,choix = c("ind"),invisible = c("ind"))+theme_minimal()
NULL
plot.PCA(res.pca,choix = c("var"))+theme_minimal()
NULL
5) Compute the correlation matrix between the variables and comment it with respect to the correlation circle
The first task is to center and standardize the variables. Then the correlation matrix is computed. All variable vectors are quite near to the boundary of the correlation circle on the variables plot - thus the variables are relatively well projected on the 2 dimensional subspace. We now turn our attention to correlations between variables.
The correlations can be visualised through the angles between variables on the correlation matrix. This can be related to the correlation matrix: - Diameter, Length and weight expose very strong corrleation: the angle between them is close to 0, suggesting correlation close to 1. - The three variables above are at an angle sightly wider than a right angle to both nb.of.pieces and Mature.Volume in the cirlce which suggests slightly negative correlation. - Price is highlighly correlatd to the three variables above - Equally, Mature.Volume and nb.of.pieces are at a slightly wider angle than a right angle which suggests slightly negative correlation.
don <- as.matrix(raw_data[,-c(1,5,6,7,9,10)]) %>% scale()
don_correlation <- cor(don)
don_correlation
Diameter weight nb.of.pieces Mature.Volume Length
Diameter 1.0000000 0.9622544 -0.14869500 -0.29164724 0.9996963
weight 0.9622544 1.0000000 -0.16884367 -0.31321323 0.9627460
nb.of.pieces -0.1486950 -0.1688437 1.00000000 -0.07462463 -0.1463770
Mature.Volume -0.2916472 -0.3132132 -0.07462463 1.00000000 -0.2936330
Length 0.9996963 0.9627460 -0.14637705 -0.29363295 1.0000000
plot.PCA(res.pca,choix = c("var"))+theme_minimal()
NULL
6) On what kind of relationship PCA focuses? Is it a problem?
PCA focuses on the linear relationships between continuous variables. Given complex links also exist, such as quadratic relationships, logarithmics, exponential functions, and so forth, this may seem restrictive, but in practice many relationships can be considered linear, at least for an initial approximation. However, there is obviously non-linear datasets for which PCA will have pitfalls (e.g. spiral dataset). Furthermore, in PCA categorical variables cannot be active variables, which can be restrictive.
7) Comment the PCA outputs
Comment the position of the categories Impermeability=type 2 and Raw.Material=PS.
The coordinates for Type 2 are (3.30430162 , 0.0020023422) for the first two principal components The coordinate for PS are (2.69084507 -0.2539199538) for the first two principal components
Both categories have a high coordinate for the first principal component. Given the correlation circle shows high correlation between the first component and price, diameter, length and weight, this suggest Type 2 and PS have high values for these variables.
res.pca$quali.sup$coord
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Supplier A 0.54805992 -0.054566515 -0.214051234 0.0227636641 0.0058684306
Supplier B -0.06543165 -0.125589918 -0.026949041 0.0018781980 -0.0006195266
Supplier C -0.44356100 1.440695488 0.728281700 -0.0670085402 -0.0056067533
Shape 1 -0.42564773 -0.137916238 -0.214123559 0.0065253174 -0.0009308749
Shape 2 1.42726960 0.394279456 0.383010989 -0.0314353290 0.0018793388
Shape 3 -0.55969671 -0.332207048 0.059604995 0.1360698967 -0.0029996610
Shape 4 -0.55191919 0.355523978 1.265466019 -0.0652825793 0.0075550964
Type 1 -0.45031131 -0.001823621 -0.009194259 0.0008200692 -0.0005392760
Type 2 3.28923043 0.013320364 0.067158065 -0.0059900708 0.0039390597
Hot Printing -0.28600503 -0.037712714 0.192161713 0.0717729126 -0.0006010224
Lacquering 0.13745978 0.018125491 -0.092356792 -0.0344955084 0.0002888635
ABS 0.87599666 0.220028373 -0.581512708 0.0043591307 -0.0032513149
PP -0.61062316 0.013651457 0.120658551 0.0042947466 -0.0005390562
PS 2.67437709 -0.253323291 -0.198579404 -0.0273071254 0.0056116038
Comment the percentage of inertia
res.pca$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 3.1071215080 62.142430160 62.14243
comp 2 1.0669070766 21.338141532 83.48057
comp 3 0.7768681861 15.537363723 99.01794
comp 4 0.0488056018 0.976112036 99.99405
comp 5 0.0002976274 0.005952549 100.00000
drawn <- c("Type 1", "Type 2", "Supplier A", "PS", "PP", "Shape 2", "Shape 3", "Shape 1", "Lacquering", "Supplier B", "Supplier C","Hot Printing")
plot.PCA(res.pca, select = "cos2 5", axes = 1:2, choix = 'ind', invisible = c('ind', 'ind.sup'), title = '')
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10, ncp=3) #ncp = 3
res.hcpc <- HCPC(res.pca, nb.clust = -1
, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
drawn <- c("90", "89", "164", "161", "163", "131")
plot.HCPC(res.hcpc, choice = 'map', draw.tree = FALSE, select = "cos2 5", title = '')
dimdesc(res.pca, axes = 1:1)
$Dim.1
$Dim.1$quanti
correlation p.value
Length 0.9853764 3.259183e-147
Diameter 0.9851090 1.784008e-146
weight 0.9774643 1.263294e-129
Price 0.7960132 4.472456e-43
nb.of.pieces -0.2017085 5.139018e-03
Mature.Volume -0.4118157 3.243173e-09
$Dim.1$quali
R2 p.value
Impermeability 0.4767041 2.203784e-28
Raw.Material 0.4309747 9.602186e-24
Shape 0.2024825 3.268025e-09
$Dim.1$category
Estimate p.value
Type 2 1.8697709 2.203784e-28
PS 1.6944602 3.078822e-20
Shape 2 1.4547681 6.874053e-11
ABS -0.1039202 1.566216e-02
Shape 1 -0.3981492 5.692581e-07
PP -1.5905400 1.465743e-20
Type 1 -1.8697709 2.203784e-28
res.hcpc$desc.var
$test.chi2
p.value df
Impermeability 5.318642e-18 2
Raw.Material 5.226547e-17 4
Shape 5.626207e-06 6
Supplier 4.102258e-02 4
$category
$category$`1`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 25.000000 97.297297 75.392670 0.0001368886 3.813724
Impermeability=Type 1 22.023810 100.000000 87.958115 0.0049829303 2.808135
Supplier=Supplier C 0.000000 0.000000 7.329843 0.0434829294 -2.019041
Raw.Material=PS 3.846154 2.702703 13.612565 0.0222292828 -2.286427
Raw.Material=ABS 0.000000 0.000000 10.994764 0.0081544536 -2.645607
Impermeability=Type 2 0.000000 0.000000 12.041885 0.0049829303 -2.808135
Shape=Shape 2 2.222222 2.702703 23.560209 0.0002330416 -3.680210
$category$`2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 74.40476 93.984962 87.958115 0.0002868332 3.626910
Supplier=Supplier C 100.00000 10.526316 7.329843 0.0050545222 2.803538
Raw.Material=PP 74.30556 80.451128 75.392670 0.0173789787 2.378590
Raw.Material=PS 38.46154 7.518797 13.612565 0.0004703729 -3.497084
Impermeability=Type 2 34.78261 6.015038 12.041885 0.0002868332 -3.626910
$category$`3`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 2 65.2173913 71.428571 12.04188 2.909966e-12 6.982005
Raw.Material=PS 57.6923077 71.428571 13.61257 4.169068e-11 6.597941
Shape=Shape 2 31.1111111 66.666667 23.56021 9.932485e-06 4.418638
Shape=Shape 1 5.3846154 33.333333 68.06283 6.715709e-04 -3.400930
Impermeability=Type 1 3.5714286 28.571429 87.95812 2.909966e-12 -6.982005
Raw.Material=PP 0.6944444 4.761905 75.39267 2.869539e-13 -7.300381
$quanti.var
Eta2 P-value
Length 0.8036361 3.530117e-67
Diameter 0.8025769 5.853470e-67
weight 0.8013378 1.053993e-66
Mature.Volume 0.7588382 8.649758e-59
Price 0.4812030 1.620918e-27
nb.of.pieces 0.1760507 1.243497e-08
$quanti
$quanti$`1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume 11.942982 2.431183e+05 82206.026178 67166.762125 9.103190e+04 7.064414e-33
Diameter -3.255425 8.214269e-01 1.294639 0.254233 9.821218e-01 1.132228e-03
Length -3.297003 6.491733e+00 10.329589 2.056760 7.864783e+00 9.772253e-04
weight -3.536244 1.100262e+00 1.714121 0.315574 1.172854e+00 4.058595e-04
nb.of.pieces -3.780986 3.324324e+00 4.115183 1.274576 1.413225e+00 1.562083e-04
Price -3.857939 1.245686e+01 16.552332 4.115901 7.172431e+00 1.143473e-04
$quanti$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 5.739229 4.503759 4.115183 1.352381e+00 1.413225e+00 9.510856e-09
Price -2.999298 15.521715 16.552332 4.620374e+00 7.172431e+00 2.706026e-03
weight -5.297059 1.416482 1.714121 3.882302e-01 1.172854e+00 1.176825e-07
Length -5.533133 8.244766 10.329589 2.492726e+00 7.864783e+00 3.145605e-08
Diameter -5.565997 1.032748 1.294639 3.121379e-01 9.821218e-01 2.606582e-08
Mature.Volume -8.031930 47177.255639 82206.026178 3.971314e+04 9.103190e+04 9.595124e-16
$quanti$`3`
v.test Mean in category Overall mean sd in category Overall sd p.value
Length 12.298789 30.295407 10.329589 7.979008e+00 7.864783e+00 9.194180e-35
Diameter 12.294570 3.787033 1.294639 1.000521e+00 9.821218e-01 9.687099e-35
weight 12.254018 4.680731 1.714121 1.164251e+00 1.172854e+00 1.598746e-34
Price 9.282815 30.295414 16.552332 8.814239e+00 7.172431e+00 1.650583e-20
Mature.Volume -3.281665 20542.857143 82206.026178 1.547128e+04 9.103190e+04 1.031962e-03
nb.of.pieces -3.659694 3.047619 4.115183 7.221786e-01 1.413225e+00 2.525166e-04
attr(,"class")
[1] "catdes" "list "
Fisher test Variance
Comments the results and describe precisely one cluster – Add Fisher Test
The cluster 1 is made of individuals sharing : - high values for the variable Mature.Volume. - low values for the variables nb.of.pieces, Price, weight, Length and Diameter (variables are sorted from the weakest).
The cluster 2 is made of individuals sharing : - high values for the variable nb.of.pieces. - low values for the variables Mature.Volume, Diameter, Length, weight and Price (variables are sorted from the weakest).
The cluster 3 is made of individuals such as 89, 90, 131, 161, 163 and 164. This group is characterized by : - high values for the variables Length, Diameter, weight and Price (variables are sorted from the strongest). - low values for the variables nb.of.pieces and Mature.Volume (variables are sorted from the weakest).
If someone ask you why you have selected k components and not k + 1 or k − 1, what is your answer? (could you suggest a strategy to assess the stability of the approach? - are there many differences between the clustering obtained on k components or on the initial data)
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=4)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=3)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=2)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc
**Results for the Hierarchical Clustering on Principal Components**
name description
1 "$data.clust" "dataset with the cluster of the individuals"
2 "$desc.var" "description of the clusters by the variables"
3 "$desc.var$quanti.var" "description of the cluster var. by the continuous var."
4 "$desc.var$quanti" "description of the clusters by the continuous var."
5 "$desc.var$test.chi2" "description of the cluster var. by the categorical var."
6 "$desc.axes$category" "description of the clusters by the categories."
7 "$desc.axes" "description of the clusters by the dimensions"
8 "$desc.axes$quanti.var" "description of the cluster var. by the axes"
9 "$desc.axes$quanti" "description of the clusters by the axes"
10 "$desc.ind" "description of the clusters by the individuals"
11 "$desc.ind$para" "parangons of each clusters"
12 "$desc.ind$dist" "specific individuals"
13 "$call" "summary statistics"
14 "$call$t" "description of the tree"
plot.HCPC(res.hcpc, choice = 'map', draw.tree = FALSE, title = '', select=c("12"))
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=3)
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc
**Results for the Hierarchical Clustering on Principal Components**
name description
1 "$data.clust" "dataset with the cluster of the individuals"
2 "$desc.var" "description of the clusters by the variables"
3 "$desc.var$quanti.var" "description of the cluster var. by the continuous var."
4 "$desc.var$quanti" "description of the clusters by the continuous var."
5 "$desc.var$test.chi2" "description of the cluster var. by the categorical var."
6 "$desc.axes$category" "description of the clusters by the categories."
7 "$desc.axes" "description of the clusters by the dimensions"
8 "$desc.axes$quanti.var" "description of the cluster var. by the axes"
9 "$desc.axes$quanti" "description of the clusters by the axes"
10 "$desc.ind" "description of the clusters by the individuals"
11 "$desc.ind$para" "parangons of each clusters"
12 "$desc.ind$dist" "specific individuals"
13 "$call" "summary statistics"
14 "$call$t" "description of the tree"
Characterization of each supplier
catdes(raw_data, num.var=1)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Raw.Material 9.049049e-05 4
Impermeability 1.088731e-02 2
$category
$category$`Supplier A`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PS 42.30769 37.93103 13.61257 0.0002998155 3.615459
Impermeability=Type 2 34.78261 27.58621 12.04188 0.0130149176 2.483361
Shape=Shape 2 26.66667 41.37931 23.56021 0.0213728107 2.301333
Raw.Material=ABS 0.00000 0.00000 10.99476 0.0254288561 -2.234825
Impermeability=Type 1 12.50000 72.41379 87.95812 0.0130149176 -2.483361
$category$`Supplier B`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=ABS 100.00000 14.18919 10.99476 0.003330616 2.935453
Raw.Material=PS 57.69231 10.13514 13.61257 0.015928453 -2.410551
Shape=Shape 2 60.00000 18.24324 23.56021 0.002374481 -3.038894
$category$`Supplier C`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 9.722222 100 75.39267 0.01626019 2.403023
$quanti.var
Eta2 P-value
nb.of.pieces 0.2137072 1.530822e-10
$quanti
$quanti$`Supplier A`
NULL
$quanti$`Supplier B`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces -2.817845 3.959459 4.115183 1.240523 1.413225 0.004834708
$quanti$`Supplier C`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 6.345875 6.428571 4.115183 1.720228 1.413225 2.211654e-10
attr(,"class")
[1] "catdes" "list "
catdes(raw_data, num.var=5)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Impermeability 2.873602e-16 3
Raw.Material 8.762044e-07 6
Finishing 1.040072e-02 3
$category
$category$`Shape 1`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 76.785714 99.2307692 87.95812 1.043033e-11 6.800436
Raw.Material=PP 72.222222 80.0000000 75.39267 3.596432e-02 2.097331
Finishing=Lacquering 72.868217 72.3076923 67.53927 4.420603e-02 2.012132
Finishing=Hot Printing 58.064516 27.6923077 32.46073 4.420603e-02 -2.012132
Raw.Material=PS 30.769231 6.1538462 13.61257 3.360691e-05 -4.147538
Impermeability=Type 2 4.347826 0.7692308 12.04188 1.043033e-11 -6.800436
$category$`Shape 2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 2 95.65217 48.88889 12.04188 2.151940e-15 7.932260
Raw.Material=PS 69.23077 40.00000 13.61257 1.004609e-07 5.325888
Supplier=Supplier A 41.37931 26.66667 15.18325 2.137281e-02 2.301333
Supplier=Supplier B 18.24324 60.00000 77.48691 2.374481e-03 -3.038894
Raw.Material=PP 16.66667 53.33333 75.39267 2.022645e-04 -3.716171
Impermeability=Type 1 13.69048 51.11111 87.95812 2.151940e-15 -7.932260
$category$`Shape 3`
NULL
$category$`Shape 4`
Cla/Mod Mod/Cla Global p.value v.test
Finishing=Hot Printing 9.677419 75 32.46073 0.0169336 2.388146
Finishing=Lacquering 1.550388 25 67.53927 0.0169336 -2.388146
$quanti.var
Eta2 P-value
Price 0.24285191 2.771217e-11
Diameter 0.23221716 9.994081e-11
Length 0.23112294 1.139178e-10
weight 0.19722569 5.965369e-09
nb.of.pieces 0.10533516 1.120672e-04
Mature.Volume 0.05693699 1.177220e-02
$quanti
$quanti$`Shape 1`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces -3.721118 3.853846 4.115183 1.2716527 1.4132247 1.983430e-04
weight -5.069026 1.418671 1.714121 0.7867181 1.1728539 3.998565e-07
Length -5.469752 8.191771 10.329589 5.0759423 7.8647827 4.506649e-08
Diameter -5.489199 1.026728 1.294639 0.6306647 0.9821218 4.037612e-08
Price -6.344431 14.290942 16.552332 4.8726895 7.1724314 2.232495e-10
$quanti$`Shape 2`
v.test Mean in category Overall mean sd in category Overall sd p.value
Diameter 6.616403 2.143782 1.294639 1.409033 9.821218e-01 3.680436e-11
Length 6.603559 17.116281 10.329589 11.260640 7.864783e+00 4.014035e-11
Price 6.176070 22.340911 16.552332 9.620363 7.172431e+00 6.571699e-10
weight 6.118100 2.651800 1.714121 1.698673 1.172854e+00 9.469770e-10
nb.of.pieces 2.865929 4.644444 4.115183 1.607698 1.413225e+00 4.157871e-03
Mature.Volume -2.005008 58355.222222 82206.026178 68318.473831 9.103190e+04 4.496223e-02
$quanti$`Shape 3`
NULL
$quanti$`Shape 4`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 3.078997 5.62500 4.115183 9.921567e-01 1.413225 0.002076987
Mature.Volume 2.629122 165250.00000 82206.026178 1.132649e+05 91031.901051 0.008560561
Price 2.044271 21.63988 16.552332 3.822103e+00 7.172431 0.040926790
attr(,"class")
[1] "catdes" "list "
catdes(raw_data, num.var=6)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Raw.Material 4.088669e-21 2
Shape 2.873602e-16 3
Supplier 1.088731e-02 2
$category
$category$`Type 1`
Cla/Mod Mod/Cla Global p.value v.test
Shape=Shape 1 99.23077 76.785714 68.06283 1.043033e-11 6.800436
Raw.Material=PP 97.91667 83.928571 75.39267 1.773212e-11 6.723573
Supplier=Supplier A 72.41379 12.500000 15.18325 1.301492e-02 -2.483361
Raw.Material=PS 30.76923 4.761905 13.61257 5.429478e-15 -7.816541
Shape=Shape 2 51.11111 13.690476 23.56021 2.151940e-15 -7.932260
$category$`Type 2`
Cla/Mod Mod/Cla Global p.value v.test
Shape=Shape 2 48.8888889 95.652174 23.56021 2.151940e-15 7.932260
Raw.Material=PS 69.2307692 78.260870 13.61257 5.429478e-15 7.816541
Supplier=Supplier A 27.5862069 34.782609 15.18325 1.301492e-02 2.483361
Raw.Material=PP 2.0833333 13.043478 75.39267 1.773212e-11 -6.723573
Shape=Shape 1 0.7692308 4.347826 68.06283 1.043033e-11 -6.800436
$quanti.var
Eta2 P-value
Diameter 0.47062626 6.604215e-28
Length 0.46804072 1.049429e-27
weight 0.45675032 7.728264e-27
Price 0.43301606 4.512224e-25
Mature.Volume 0.07171395 1.801495e-04
$quanti
$quanti$`Type 1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume 3.691294 91225.988095 82206.026178 9.338486e+04 9.103190e+04 2.231162e-04
Price -9.070449 14.805996 16.552332 4.819967e+00 7.172431e+00 1.185272e-19
weight -9.315716 1.420835 1.714121 6.707159e-01 1.172854e+00 1.211330e-20
Length -9.430150 8.338742 10.329589 4.357114e+00 7.864783e+00 4.095012e-21
Diameter -9.456161 1.045344 1.294639 5.411724e-01 9.821218e-01 3.194554e-21
$quanti$`Type 2`
v.test Mean in category Overall mean sd in category Overall sd p.value
Diameter 9.456161 3.115573 1.294639 1.449522 9.821218e-01 3.194554e-21
Length 9.430150 24.871426 10.329589 11.600832 7.864783e+00 4.095012e-21
weight 9.315716 3.856391 1.714121 1.708742 1.172854e+00 1.211330e-20
Price 9.070449 29.308174 16.552332 8.516118 7.172431e+00 1.185272e-19
Mature.Volume -3.691294 16321.086957 82206.026178 13496.587327 9.103190e+04 2.231162e-04
attr(,"class")
[1] "catdes" "list "
prediction(p)
'newdata' had 1 row but variables found have 191 rows
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27.69272 34.70339 34.69759 20.11863 14.81027 15.40691 19.43986 22.45541 22.49163 22.54425 16.58928 13.68044 17.24941 20.55373 21.55533 17.30082 12.72796 10.87910 10.70474 13.42236
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
13.73555 13.74849 15.80902 15.90384 11.46448 13.46473 22.25682 15.87875 13.42327 13.20923 17.55902 13.78838 13.78414 13.94349 15.91431 15.77545 17.47118 17.57738 13.91997 13.76739
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
13.77252 15.87151 15.79862 15.75066 17.91152 11.68277 25.68110 10.94533 10.92800 10.99362 13.54678 21.52712 15.86714 15.57777 17.64044 16.25265 16.17517 13.53198 13.47022 14.68320
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
14.87049 14.89404 13.99963 15.08266 15.38655 15.00378 13.35540 14.59563 10.76404 15.04518 16.01022 12.94006 12.72004 11.16594 14.86233 13.47580 10.30001 14.60612 10.78620 13.12756
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
13.12973 14.49869 14.35313 41.44650 41.69060 16.88254 20.21065 11.37100 11.61257 17.48260 17.57313 17.58805 17.53195 12.25359 12.28111 12.58934 21.85734 23.31011 11.29687 21.39414
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
22.73885 10.93979 24.75645 10.15192 12.30480 17.40617 15.90527 12.98452 17.07700 16.12695 13.46384 13.48858 12.33844 15.00702 12.37759 14.98899 15.15678 14.17156 15.55844 15.32873
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
14.11369 15.38907 16.85773 17.31336 10.77652 36.37604 36.36689 17.42748 14.39460 13.13081 13.21011 14.75478 14.34834 10.83483 14.57894 15.22337 15.33934 15.28413 15.33725 15.38926
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
11.37071 11.18142 12.43299 10.71426 10.66502 10.63671 10.69176 14.49358 14.56753 14.53136 14.15525 15.68536 14.70337 16.16348 17.19244 37.88701 33.28545 33.67779 39.66891 36.32280
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
18.92520 13.42249 14.10767 14.08426 14.17276 12.00040 27.02591 14.81076 15.26243 14.22626 14.11547 14.13328 14.14734 14.36611 16.46547 15.02260 13.92552 18.96046 16.76339 17.95008
181 182 183 184 185 186 187 188 189 190 191
16.74113 18.62731 21.55976 14.99084 14.98708 15.05771 14.97594 17.40471 17.34680 17.38971 17.40890
res.hcpc.famd$call$X$Dim.1
[1] -1.98063919 -1.83264787 -1.83106329 -1.78617733 -1.77810618 -1.76793474 -1.76710263 -1.75740608 -1.74971627 -1.74854492 -1.72234966 -1.66587800 -1.64757197 -1.64122934
[15] -1.63598508 -1.56390795 -1.53944334 -1.53334618 -1.48308272 -1.38978669 -1.31431198 -1.31360993 -1.30659672 -1.30106267 -1.24853337 -1.22972350 -1.22095845 -1.21456405
[29] -1.18418458 -1.18021014 -1.17935581 -1.17927296 -1.17545877 -1.16352170 -1.15631274 -1.15536248 -1.14893262 -1.14368481 -1.13904799 -1.11948077 -1.11592461 -1.11166467
[43] -1.10864477 -1.07786759 -1.07465242 -1.07231358 -1.05375226 -1.03073238 -1.01177639 -1.00275909 -1.00244847 -0.99953402 -0.99685597 -0.99557798 -0.99270885 -0.98055457
[57] -0.98041153 -0.97369797 -0.97319443 -0.96309915 -0.94797090 -0.93951350 -0.92381405 -0.92380588 -0.92312060 -0.92280368 -0.92215711 -0.91733190 -0.91127725 -0.90468216
[71] -0.88924490 -0.88886587 -0.88556501 -0.87235273 -0.86644337 -0.86479528 -0.83275001 -0.81147499 -0.77426760 -0.77398705 -0.77256095 -0.76784626 -0.76735493 -0.76500699
[85] -0.75906301 -0.74956767 -0.74385898 -0.72805741 -0.71957456 -0.70612214 -0.67838613 -0.67821432 -0.64991811 -0.64920144 -0.64269013 -0.63822303 -0.63582084 -0.63163680
[99] -0.62753059 -0.62599476 -0.62186560 -0.60480936 -0.58138172 -0.57247016 -0.56818608 -0.55787880 -0.55368923 -0.54139477 -0.53854440 -0.53480620 -0.52898642 -0.52824415
[113] -0.52616669 -0.52564460 -0.52454183 -0.52449868 -0.51962886 -0.51480527 -0.50739799 -0.49352372 -0.48049116 -0.47854325 -0.47511958 -0.45213593 -0.44366084 -0.41417752
[127] -0.40516410 -0.39864924 -0.39340526 -0.38440624 -0.38336093 -0.38001794 -0.37853964 -0.35220981 -0.34431429 -0.32536820 -0.31982441 -0.31803255 -0.30039211 -0.28421095
[141] -0.28417967 -0.27830495 -0.27577096 -0.26247151 -0.25453002 -0.24784546 -0.23849067 -0.23173017 -0.23034051 -0.22769867 -0.15845660 -0.05646856 -0.05377382 0.05889003
[155] 0.07158955 0.07913132 0.16276548 0.79185578 1.06680161 1.09360971 1.14437872 1.21748198 1.25067158 1.37109701 2.03126674 2.20958460 2.25060435 2.48114155
[169] 2.49540111 2.51611986 2.68062272 3.05674906 3.07969012 3.23130366 3.77350671 3.80195508 4.13207845 4.32016658 4.35027614 4.72999750 5.26696337 5.37357116
[183] 5.86802720 5.87862818 6.58628131 6.60222800 6.60582995 7.20144721 7.28664755 7.39353843 7.91577973